Python for Text Similarities 1

Download: This and various other Jupyter notebooks are available from my GitHub repo.

Version: 1.1, September 2019

License: Creative Commons Attribution-ShareAlike 4.0 International License (CA BY-SA 4.0)

This is a tutorial related to the discussion of text similarities in the textbook Machine Learning: The Art and Science of Algorithms that Make Sense of Data by Peter Flach.

This tutorial was developed as part of my course material for the course Machine Learning for Computational Linguistics in the Computational Linguistics Program of the Department of Linguistics at Indiana University.

Jaccard coefficient

To calculate the Jaccard coefficient we prepare two texts:



In [13]:

    
text1 = """Our medicine cures baldness. No diagnostics needed.
           We guarantee Fast Viagra delivery.
           We can provide Human growth hormone. The cheapest Life
           Insurance with us. You can Lose weight with this treatment.
           Our Medicine now and No medical exams necessary.
           Our Online pharmacy is the best.  This cream Removes
           wrinkles and Reverses aging.
           One treatment and you will Stop snoring.  We sell Valium
           and Viagra.
           Our Vicodin will help with Weight loss. Cheap Xanax."""
text2 = """Dear ,
           we sell the cheapest and best Viagra on the planet. Our delivery is
           guaranteed confident and cheap.
        """

We import the word_tokenizer from the NLTK module. We convert the tokenlist from each text to a set of types.



In [14]:

    
from nltk import word_tokenize

types1 = set(word_tokenize(text1))
types2 = set(word_tokenize(text2))

The types in the first text are:



In [15]:

    
print(types1)









    



{'aging', 'and', 'Vicodin', 'can', 'Human', 'Medicine', 'needed', 'hormone', 'now', 'Removes', 'Lose', 'with', 'you', 'Xanax', 'Reverses', 'Weight', 'sell', 'Life', 'You', 'Valium', 'medical', 'is', 'snoring', 'baldness', 'Viagra', 'Fast', 'necessary', 'This', 'No', 'cream', 'provide', '.', 'us', 'One', 'will', 'delivery', 'growth', 'weight', 'exams', 'wrinkles', 'the', 'help', 'best', 'Our', 'We', 'medicine', 'The', 'guarantee', 'Insurance', 'Online', 'Stop', 'cheapest', 'treatment', 'pharmacy', 'cures', 'diagnostics', 'Cheap', 'this', 'loss'}

We can generate the instersection from the two sets of types in the following way:



In [16]:

    
print(set.intersection(types1, types2))









    



{'cheapest', 'sell', 'delivery', 'and', 'is', 'the', 'Viagra', 'best', 'Our', '.'}

To calculate the Jaccard coefficient we divide the length of the intersection of the sets of types by the length of the union of these sets:



In [17]:

    
lenIntersect = len(set.intersection(types1, types2))
lenUnion = len(set.union(types1, types2))

print(lenIntersect / lenUnion)









    



0.14925373134328357

This division is equivalent to the division of $\frac{words\,in\,both\,sets}{(words\,in\,set\,1)\,+\,(words\,in\,set\,2)\,-\,(words\,in\,both\,sets)}$.